-
Notifications
You must be signed in to change notification settings - Fork 8.2k
tests: posix: stress test for pthread_create and pthread_join #58218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had a few questions before I click the approve button. I don't have any concrete suggestions for changes.
|
Rebased on top of several recent RISC-V changes and it is showing definite stability improvements on this test. I'm going to enable it now. I am unfortunately still seeing major issues with |
|
@hakehuang, @keith-packard - PTAL |
|
Interesting benchmarks as well. |
|
@hakehuang @keith-packard - PTAL again |
d8a8fe6 to
fb15a1c
Compare
Thanks - yeah, I always forget that CC @stephanosio - might need your merge superpowers to get past the false positive. |
|
@keith-packard note that all tests on SMP platforms get retried twice in CI. There are known glitches within timing between SMP cores that can't be worked around, and that affect some of the naive timing logic in many of the tests (you can have one CPU get halted for 10ms+ because the host OS didn't schedule it, etc...). So if you're only seeing a handful of failures, that's probably not going to be detectable by CI. Just run twister with "--only-failed" after your run and verify that they succeed reliably the second time (they almost always do, because the multicore host is now less loaded with no builds going on). FWIW: if the frequency can be measured and it's going up with this PR, though, we should track that somewhere. The kernel bug that was opened is as good a place as any if you have numbers like that. |
I'm getting a repeatable failure with qemu_x86_64 -- did you intend to exclude that platform as well? Or am I hitting an unexpected problem? It's the 'aborted _current back from dead' trouble in |
|
FWIW: @carlocaione found a fix for the kernel bug in #58116. It's just a one-liner you could try right now. I'm working up a slightly more gold plated version, but just one wait_for_switch() call was all that was missing. |
Yeah, I do try to remember to run a second time with -f to see if that's the trouble. In this case, having each of the two internal tests run for 29 seconds while the overall timeout was set at 60 seconds was causing spurious failure more times than not, leading me to be concerned that we'd hit failures even with retries.
I'll see if I can find some time to poke at that again. |
@keith-packard - I could exclude that platform too if it becomes problematic. IIRC, we have |
Yeah, and CI is happy. Ok, I think this is 'good enough' and will let us tweak it to validate any core kernel fixes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need fix the compliance issue. Others looks good to me
looks like the compliance issue isn't real? Just using |
|
@stephanosio - you should be ok to merge this I think - PTAL |
|
@jgl-meta - also, PTAL. I would suggest running this a few times on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding __unused attribute to int ret, I would suggest using if (IS_ENABLED(...)) instead of #ifdef so that the compiler is aware of the execution path that uses ret.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea... I was really on the fence there. In theory, the linker should discard any dead code, and perhaps it's just me being paranoid, but I found that the #ifdef seemed to result in fewer failed assertions.
I'll respin. Ultimately though, I want @jgl-meta to make the call whether this workaround / test is merged first or whether he will hold out for pthreads to be water tight.
The latter is ideal, but there is not always time for the ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea... I was really on the fence there. In theory, the linker should discard any dead code, and perhaps it's just me being paranoid, but I found that the #ifdef seemed to result in fewer failed assertions.
Yes, compiler should optimise out any dead code. Linux kernel coding style also recommends using IS_ENABLED instead of preprocessor conditionals whenever possible for this reason -- maybe it is about the time we document this in the Zephyr coding guidelines ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made @stephanosio 's suggestions.
This reverts commit 7c17bda. Signed-off-by: Christopher Friedt <[email protected]>
The `aborted _current back from dead` error may appear in this particular test under Qemu (and in particular Qemu SMP) systems. A small delay between `pthread_create()` and `pthread_join()` is sufficient to mitigate the issue. Signed-off-by: Christopher Friedt <[email protected]>
Recently, a race condition was discovered in `pthread_create()` and `pthread_join()` that would trigger an assertion in `kernel/sched.c` of the form below. ``` aborted _current back from dead ``` This was mainly observed in Qemu. Unfortunately, Qemu SMP platforms exhibit a real and measurable "scheduler noise" which makes testing rather difficult. Signed-off-by: Christopher Friedt <[email protected]>
Recently, a race condition was discovered in
pthread_create()andpthread_join()that would trigger an assertion inkernel/sched.cof the form below.This was mainly observed in Qemu. Unfortunately, Qemu SMP platforms exhibit a real and measurable "scheduler noise" which makes testing rather difficult.
Additional changes: